[SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files by 10110346 · Pull Request #22358 · apache/spark

10110346 · 2018-09-07T07:36:42Z

What changes were proposed in this pull request?

Hadoop2.6 and hadoop2.7 do not contain zstd and brotli compressioncodec ,hadoop 3.1 also contains only zstd compressioncodec .
So I think we should remove zstd and brotil for the time being.

set spark.sql.parquet.compression.codec=brotli:
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.BrotliCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)

set spark.sql.parquet.compression.codec=zstd:
Caused by: org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.hadoop.io.compress.ZStandardCodec was not found
at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:235)
at org.apache.parquet.hadoop.CodecFactory$HeapBytesCompressor.(CodecFactory.java:142)
at org.apache.parquet.hadoop.CodecFactory.createCompressor(CodecFactory.java:206)
at org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:189)
at org.apache.parquet.hadoop.ParquetRecordWriter.(ParquetRecordWriter.java:153)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:411)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:161)

How was this patch tested?

Exist unit test

HyukjinKwon · 2018-09-07T07:40:38Z

but if there are the codecs found, we support those compressions, no?

wangyum · 2018-09-07T07:51:53Z

docs/sql-programming-guide.md

I prefer none, uncompressed, snappy, gzip, lzo, brotli(need install ...), lz4, zstd(need install ...).

Installation may not be able to solve it.

none, uncompressed, snappy, gzip, lzo, brotli(need install brotli-codec), lz4, zstd(since Hadoop 2.9.0)

https://jira.apache.org/jira/browse/HADOOP-13578
https://github.com/rdblue/brotli-codec
https://jira.apache.org/jira/browse/HADOOP-13126

got it,thanks @wangyum

hadoop-2.9.x is officially supported in Spark?

I think so given the download page.

10110346 · 2018-09-07T08:43:28Z

It is using reflection to acquire hadoop classes for compression which are not in the (hadoop-common-2.6.5.jar, hadoop-common-2.7.0.jar, hadoop-common-3.1.0.jar).

BROTLI("org.apache.hadoop.io.compress.BrotliCodec", CompressionCodec.BROTLI, ".br"), ZSTD("org.apache.hadoop.io.compress.ZStandardCodec", CompressionCodec.ZSTD, ".zstd");

10110346 · 2018-09-07T08:47:03Z

Thanks， if there are the codecs found, we support those compressions, but how do I find it? @HyukjinKwon

HyukjinKwon · 2018-09-07T09:02:24Z

That's probably something we should document, or improve the error message. Ideally, we should fix the error message from Parquet. Don't you think?

10110346 · 2018-09-07T09:19:26Z

yeah， the error message is output from external jar(parquet-common-1.10.0.jar),
I think spark + parquet should avoid the hadoop dependencies for zstd and brotil,.
But maybe we can't solve it right away.

SparkQA · 2018-09-07T11:36:40Z

Test build #95785 has finished for PR 22358 at commit 1db036a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-08T04:44:12Z

If the codecs are found, then we support it. One thing we should do might be to document to explicitly provide the codec but I am not sure how many users are confused about it.

maropu · 2018-09-08T05:58:29Z

just fyi about related talks: #21070 (comment)
cc: @rdblue

felixcheung · 2018-09-09T18:33:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

I thought if you remove it from here the user would not be able to use zstd or brotli even if it is installed/enabled/available?

I agree with you, removing is not a good idea.
Thanks.

SparkQA · 2018-09-10T02:03:00Z

Test build #95852 has finished for PR 22358 at commit 5c478b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-10T02:28:36Z

I am 0 on this since it is worthClass org.apache.hadoop.io.compress.XXXCodec was not found error message vs need install ... message.

HyukjinKwon · 2018-09-10T02:29:08Z

docs/sql-programming-guide.md

I would just add few lines for brotli and zstd below and leave the original text as is.

SparkQA · 2018-09-11T09:20:13Z

Test build #95930 has finished for PR 22358 at commit dd86d3f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-11T12:54:07Z

docs/sql-programming-guide.md

needs install -> needs to install

HyukjinKwon · 2018-09-11T12:55:08Z

I'm okay but I would close this if no committer agree with (approves) this for some long time.

SparkQA · 2018-09-12T01:19:07Z

Test build #95969 has finished for PR 22358 at commit 64aef6b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2018-09-12T01:30:39Z

docs/sql-programming-guide.md

@HyukjinKwon How about adding a link? Users may not know where to download it.

`brotliCodec` -> [`brotli-codec`](https://github.com/rdblue/brotli-codec)

If the link looks expected to be rather permanent, it's fine.

It is more clear to say "zstd requires ZStandardCodec to be installed".

HyukjinKwon · 2018-09-20T02:01:38Z

docs/sql-programming-guide.md

brotliCodec -> BrotliCodec

SparkQA · 2018-09-20T02:11:23Z

Test build #96312 has finished for PR 22358 at commit 39eaf1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-20T03:13:36Z

Test build #96314 has finished for PR 22358 at commit 0e5d0bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-20T03:29:38Z

@srowen and @vanxin WDYT?

srowen

I think a bit of documentation is OK.

wangyum reviewed Sep 7, 2018

View reviewed changes

felixcheung reviewed Sep 9, 2018

View reviewed changes

10110346 changed the title ~~[SPARK-25366][SQL]Zstd and brotil CompressionCodec are not supported for parquet files~~ [SPARK-25366][SQL]Zstd and brotli CompressionCodec are not supported for parquet files Sep 10, 2018

10110346 force-pushed the notsupportzstdandbrotil branch from 1db036a to 5c478b9 Compare September 10, 2018 01:42

HyukjinKwon reviewed Sep 10, 2018

View reviewed changes

docs/sql-programming-guide.md Outdated

Copy link

Member

HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add few lines for brotli and zstd below and leave the original text as is.

10110346 force-pushed the notsupportzstdandbrotil branch from 5c478b9 to dd86d3f Compare September 11, 2018 08:59

HyukjinKwon reviewed Sep 11, 2018

View reviewed changes

docs/sql-programming-guide.md Outdated

Copy link

Member

HyukjinKwon Sep 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs install -> needs to install

10110346 force-pushed the notsupportzstdandbrotil branch from dd86d3f to 64aef6b Compare September 12, 2018 00:58

wangyum reviewed Sep 12, 2018

View reviewed changes

10110346 force-pushed the notsupportzstdandbrotil branch from 64aef6b to 39eaf1d Compare September 20, 2018 01:49

HyukjinKwon reviewed Sep 20, 2018

View reviewed changes

docs/sql-programming-guide.md Outdated

Copy link

Member

HyukjinKwon Sep 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brotliCodec -> BrotliCodec

fix

0e5d0bc

10110346 force-pushed the notsupportzstdandbrotil branch from 39eaf1d to 0e5d0bc Compare September 20, 2018 02:49

srowen approved these changes Sep 20, 2018

View reviewed changes

asfgit closed this in 4d114fc Sep 20, 2018

Conversation

10110346 commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Sep 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

10110346 commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

10110346 commented Sep 7, 2018

Uh oh!

HyukjinKwon commented Sep 7, 2018

Uh oh!

10110346 commented Sep 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 7, 2018

Uh oh!

HyukjinKwon commented Sep 8, 2018

Uh oh!

maropu commented Sep 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 10, 2018

Uh oh!

HyukjinKwon commented Sep 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 11, 2018

Uh oh!

SparkQA commented Sep 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 20, 2018

Uh oh!

SparkQA commented Sep 20, 2018

Uh oh!

HyukjinKwon commented Sep 20, 2018

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10110346 commented Sep 7, 2018 •

edited

Loading

10110346 commented Sep 7, 2018 •

edited

Loading

10110346 commented Sep 7, 2018 •

edited

Loading